74 research outputs found
Multicolumn Networks for Face Recognition
The objective of this work is set-based face recognition, i.e. to decide if
two sets of images of a face are of the same person or not. Conventionally, the
set-wise feature descriptor is computed as an average of the descriptors from
individual face images within the set. In this paper, we design a neural
network architecture that learns to aggregate based on both "visual" quality
(resolution, illumination), and "content" quality (relative importance for
discriminative classification). To this end, we propose a Multicolumn Network
(MN) that takes a set of images (the number in the set can vary) as input, and
learns to compute a fix-sized feature descriptor for the entire set. To
encourage high-quality representations, each individual input image is first
weighted by its "visual" quality, determined by a self-quality assessment
module, and followed by a dynamic recalibration based on "content" qualities
relative to the other images within the set. Both of these qualities are learnt
implicitly during training for set-wise classification. Comparing with the
previous state-of-the-art architectures trained with the same dataset
(VGGFace2), our Multicolumn Networks show an improvement of between 2-6% on the
IARPA IJB face recognition benchmarks, and exceed the state of the art for all
methods on these benchmarks.Comment: To appear in BMVC201
Comparator Networks
The objective of this work is set-based verification, e.g. to decide if two
sets of images of a face are of the same person or not. The traditional
approach to this problem is to learn to generate a feature vector per image,
aggregate them into one vector to represent the set, and then compute the
cosine similarity between sets. Instead, we design a neural network
architecture that can directly learn set-wise verification. Our contributions
are: (i) We propose a Deep Comparator Network (DCN) that can ingest a pair of
sets (each may contain a variable number of images) as inputs, and compute a
similarity between the pair--this involves attending to multiple discriminative
local regions (landmarks), and comparing local descriptors between pairs of
faces; (ii) To encourage high-quality representations for each set, internal
competition is introduced for recalibration based on the landmark score; (iii)
Inspired by image retrieval, a novel hard sample mining regime is proposed to
control the sampling process, such that the DCN is complementary to the
standard image classification models. Evaluations on the IARPA Janus face
recognition benchmarks show that the comparator networks outperform the
previous state-of-the-art results by a large margin.Comment: To appear in ECCV 201
Diagnosing Human-object Interaction Detectors
Although we have witnessed significant progress in human-object interaction
(HOI) detection with increasingly high mAP (mean Average Precision), a single
mAP score is too concise to obtain an informative summary of a model's
performance and to understand why one approach is better than another. In this
paper, we introduce a diagnosis toolbox for analyzing the error sources of the
existing HOI detection models. We first conduct holistic investigations in the
pipeline of HOI detection, consisting of human-object pair detection and then
interaction classification. We define a set of errors and the oracles to fix
each of them. By measuring the mAP improvement obtained from fixing an error
using its oracle, we can have a detailed analysis of the significance of
different errors. We then delve into the human-object detection and interaction
classification, respectively, and check the model's behavior. For the first
detection task, we investigate both recall and precision, measuring the
coverage of ground-truth human-object pairs as well as the noisiness level in
the detections. For the second classification task, we compute mAP for
interaction classification only, without considering the detection scores. We
also measure the performance of the models in differentiating human-object
pairs with and without actual interactions using the AP (Average Precision)
score. Our toolbox is applicable for different methods across different
datasets and available at https://github.com/neu-vi/Diag-HOI
VGGFace2: A dataset for recognising faces across pose and age
In this paper, we introduce a new large-scale face dataset named VGGFace2.
The dataset contains 3.31 million images of 9131 subjects, with an average of
362.6 images for each subject. Images are downloaded from Google Image Search
and have large variations in pose, age, illumination, ethnicity and profession
(e.g. actors, athletes, politicians). The dataset was collected with three
goals in mind: (i) to have both a large number of identities and also a large
number of images for each identity; (ii) to cover a large range of pose, age
and ethnicity; and (iii) to minimize the label noise. We describe how the
dataset was collected, in particular the automated and manual filtering stages
to ensure a high accuracy for the images of each identity. To assess face
recognition performance using the new dataset, we train ResNet-50 (with and
without Squeeze-and-Excitation blocks) Convolutional Neural Networks on
VGGFace2, on MS- Celeb-1M, and on their union, and show that training on
VGGFace2 leads to improved recognition performance over pose and age. Finally,
using the models trained on these datasets, we demonstrate state-of-the-art
performance on all the IARPA Janus face recognition benchmarks, e.g. IJB-A,
IJB-B and IJB-C, exceeding the previous state-of-the-art by a large margin.
Datasets and models are publicly available.Comment: This paper has been accepted by IEEE Conference on Automatic Face and
Gesture Recognition (F&G), 2018. (Oral
Feature Tracking Cardiac Magnetic Resonance via Deep Learning and Spline Optimization
Feature tracking Cardiac Magnetic Resonance (CMR) has recently emerged as an
area of interest for quantification of regional cardiac function from balanced,
steady state free precession (SSFP) cine sequences. However, currently
available techniques lack full automation, limiting reproducibility. We propose
a fully automated technique whereby a CMR image sequence is first segmented
with a deep, fully convolutional neural network (CNN) architecture, and
quadratic basis splines are fitted simultaneously across all cardiac frames
using least squares optimization. Experiments are performed using data from 42
patients with hypertrophic cardiomyopathy (HCM) and 21 healthy control
subjects. In terms of segmentation, we compared state-of-the-art CNN
frameworks, U-Net and dilated convolution architectures, with and without
temporal context, using cross validation with three folds. Performance relative
to expert manual segmentation was similar across all networks: pixel accuracy
was ~97%, intersection-over-union (IoU) across all classes was ~87%, and IoU
across foreground classes only was ~85%. Endocardial left ventricular
circumferential strain calculated from the proposed pipeline was significantly
different in control and disease subjects (-25.3% vs -29.1%, p = 0.006), in
agreement with the current clinical literature.Comment: Accepted to Functional Imaging and Modeling of the Heart (FIMH) 201
Multi-modal classifiers for open-vocabulary object detection
The goal of this paper is open-vocabulary object detection (OVOD) – building a model that can detect objects beyond the set of categories seen at training, thus enabling the user to specify categories of interest at inference without the need for model retraining. We adopt a standard two-stage object detector architecture, and explore three ways for specifying novel categories: via language descriptions, via image exemplars, or via a combination of the two. We make three contributions: first, we prompt a large language model (LLM) to generate informative language descriptions for object classes, and construct powerful text-based classifiers; second, we employ a visual aggregator on image exemplars that can ingest any number of images as input, forming vision-based classifiers; and third, we provide a simple method to fuse information from language descriptions and image exemplars, yielding a multi-modal classifier. When evaluating on the challenging LVIS open-vocabulary benchmark we demonstrate that: (i) our text-based classifiers outperform all previous OVOD works; (ii) our vision-based classifiers perform as well as text-based classifiers in prior work; (iii) using multi-modal classifiers perform better than either modality alone; and finally, (iv) our text-based and multi-modal classifiers yield better performance than a fully-supervised detector
Inducing Predictive Uncertainty Estimation for Face Recognition
Knowing when an output can be trusted is critical for reliably using face
recognition systems. While there has been enormous effort in recent research on
improving face verification performance, understanding when a model's
predictions should or should not be trusted has received far less attention.
Our goal is to assign a confidence score for a face image that reflects its
quality in terms of recognizable information. To this end, we propose a method
for generating image quality training data automatically from 'mated-pairs' of
face images, and use the generated data to train a lightweight Predictive
Confidence Network, termed as PCNet, for estimating the confidence score of a
face image. We systematically evaluate the usefulness of PCNet with its error
versus reject performance, and demonstrate that it can be universally paired
with and improve the robustness of any verification model. We describe three
use cases on the public IJB-C face verification benchmark: (i) to improve 1:1
image-based verification error rates by rejecting low-quality face images; (ii)
to improve quality score based fusion performance on the 1:1 set-based
verification benchmark; and (iii) its use as a quality measure for selecting
high quality (unblurred, good lighting, more frontal) faces from a collection,
e.g. for automatic enrolment or display.Comment: To Appear at the British Machine Vision Conference (BMVC), 202
Turbo Training with Token Dropout
The objective of this paper is an efficient training method for video tasks.
We make three contributions: (1) We propose Turbo training, a simple and
versatile training paradigm for Transformers on multiple video tasks. (2) We
illustrate the advantages of Turbo training on action classification,
video-language representation learning, and long-video activity classification,
showing that Turbo training can largely maintain competitive performance while
achieving almost 4X speed-up and significantly less memory consumption. (3)
Turbo training enables long-schedule video-language training and end-to-end
long-video training, delivering competitive or superior performance than
previous works, which were infeasible to train under limited resources.Comment: BMVC202
- …